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Abstract 

Most scripting languages nowadays use regex pattern-matching libraries. 
These regex libraries borrow the syntax of regular expressions, but have an in- 
formal semantics that is different from the semantics of regular expressions, 
removing the commutativity of alternation and adding ad-hoc extensions 
that cannot be expressed by formalisms for efficient recognition of regular 
languages, such as deterministic finite automata. 

Parsing Expression Grammars are a formalism that can describe all de- 
terministic context-free languages and has a simple computational model. 
In this paper, we present a formalization of regexes via transformation to 
Parsing Expression Grammars. The proposed transformation easily accom- 
modates several of the common regex extensions, giving a formal meaning to 
them. It also provides a clear computational model that helps to estimate 
the efficiency of regex-based matchers, and a basis for specifying provably 
correct optimizations for them. 
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1. Introduction 

Regular expressions are a concise way for describing regular languages 
with an algebraic notation. Their syntax has seen wide use in pattern match- 
ing libraries for programming languages, where they are used to specify the 
pattern against which a user is trying to match a string, or, more commonly, 
the pattern that a user is searching for in a string. 

Regular expressions used for pattern matching are known as regexes jH, 
[ij, and while they look like regular expressions they often have different 
semantics, based on how the pattern matching libraries that use them are 
actually implemented. 

A simple example that shows this semantic difference are the regular 
expressions a\aa and aa\a, which both describe the language {a, aa}. It is 
trivial to prove that the | operator of regular expressions is commutative given 
its common semantics as the union of sets. But the regexes a\aa and aa\a 
behave differently for some implementations of pattern matching libraries 
and some subjects. 

The standard regex libraries of the Perl and Ruby languages, as well as 
PCRE jsj, a regex library with bindings for many programming languages, 
give different results when matching these two regexes against the subject 
aa. In all three libraries the first regex matches just the first a of the subject, 
while the second regex matches the whole subject. With the subject ab both 
regexes give the same answer in all three libraries, matching the first a, but 
we can see that the | operator for regexes is not commutative. 

This behavior of regexes is directly linked to the way they are usually im- 
plemented, by trying the alternatives in a | expression in the order they ap- 
pear and backtracking when a particular path through the expression makes 
the match fail. 

A naive implementation of regex matching via backtracking can have 
exponential worst-case running time, which implementations try to avoid 
through ad-hoc optimizations to cut the amount of backtracking that needs 
to be done for common patterns. These ad-hoc optimizations lead to im- 
plementations not having a cost model of their operation, which makes it 
difficult for users to determine the performance of regex patterns. Simple 
modifications can make the time complexity of a pattern go from linear to 
exponential in unpredictable ways 0, [sj . 

Regexes can also have syntactical and semantical extensions that are dif- 
ficult, or even impossible, to express through pure regular expressions. These 
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extensions do not have a formal model, but are informally specified through 
how they modify the behavior of an implementation based on backtrack- 
ing. The meaning of regex patterns that use the extensions may vary among 
different regex libraries jof, or even among different implementations of the 
same regex library [3]. 

Practical regex libraries try to solve performance problems with ad-hoc 
optimizations for common patterns, but this makes the implementation of a 
regex library a complex task, and is another source of unpredictable perfor- 
mance, as different implementations can have different performance charac- 
teristics. 

A heavily optimized regex engine, RE2 jsj, uses an implementation based 
on finite automata and guarantees linear time performance, but it relies on 
ad-hoc optimizations to handle more complex patterns, as a naive automata- 
based implementation can have quadratic behavior J9|. More importantly, it 
cannot implement some common regex extensions [8|. 

Parsing Expression Grammars (PEGs) are a formalism that can ex- 
press all deterministic context-free languages, which means that PEGs can 
also express all regular languages. The syntax of PEGs is based on the syn- 
tax of regular expressions and regexes, and PEGs have a formal semantics 
based on ordered choice, a controlled form of backtracking that, like the | 
operation of regexes, is sensitive to the ordering of the alternatives. 

We believe that ordered choice makes PEGs a suitable base for a formal 
treatment of regexes, and show, in this paper, that we can describe the 
meaning of regex patterns by conversion to PEGs. Moreover, PEGs can 
be efficiently executed by a parsing machine that has a clear cost model 
that we can use to reason about the time complexity of matching a given 



pattern [111, Il2[. We can then use the semantics of PEGs to reason about the 
behavior of regexes, for example, to optimize pattern matching and searching 
by avoiding excessive backtracking. We believe that the combination of the 
regex to PEG conversion and the PEG parsing machine can be used to build 
implementations of regex libraries that are simpler and easier to extend than 
current ones. 

The main contribution of this paper is our formalization of a simple, 
structure-preserving translation from plain regular expressions to PEGs that 
can be used to translate regexes to PEGs that match the same subjects. We 
present a formalization of regular expressions as patterns that match prefixes 
of strings instead of sets of strings, using the framework of natural seman- 
tics [13]. In this semantics, regular expressions are just a non-deterministic 
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form of the regexes used for pattern matching. We show that our seman- 
tics is equivalent to the standard set-based semantics when we consider the 
language of a pattern as the set of prefixes that it matches. 

We then present a formalization of PEGs in the same style, and use it 
to show the similarities and differences between regular expressions, regexes, 
and PEGs. We then define a transformation that converts a regular expres- 
sion to a PEG, and prove its correctness. We also show how we can improve 
the transformation for some classes of regexes by exploiting their properties 
and the greater predictability and control of performance that PEGs have, 
improving the performance of the resulting PEGs. Finally, we show how 
our transformation can be adapted to accommodate four regex extensions 
that cannot be expressed by regular expressions: independent expressions, 
possessive and lazy repetition, and lookahead. 

There are procedures for transforming deterministic finite automata and 



right-linear grammars to PEGs |lll.ll4l| and, as there are transformations from 



regular expressions to these formalisms, we could have used these existing 
procedures as the basis of an implementation of regular expressions in PEG 
engines. But the transformations of regular expressions to these formalisms 
cover just a subset of regexes, not including common extensions, including 
those covered in Section[6]of this paper. The direct transformation we present 
here is straightforward and can cover regex extensions. 

In the next section, we present our formalizations of regular expressions 
and PEGs, and discuss when a regular expression has the same meaning when 
interpreted as a regular expression and as a PEG, along with the intuition 
behind our transformation. In Section [3], we formalize our transformation 
from regular expressions to PEGs and prove its correctness. In Section H] 
we show how we can reason about the performance of PEGs to improve the 
PEGs generated by our transformation in some cases. In Section [S] we show 
how our approach compares to existing regex implementations with some 
benchmarks. In Section [6] we show how our transformation can accommodate 
some regex extensions. Finally, in Section [7] we discuss some related work 
and present our conclusions. 

2. Regular Expressions and PEGs 

Given a finite alphabet T, we can define a regular expression e inductively 
as follows, where a G T, and both ei and 62 are also regular expressions: 



ei 62 ei 62 e 
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Traditionally, a regular expression can also be 0, but we will not consider 
it; is not used in regexes, and any expression with as a subexpression can 
be either rewritten without or is equal to 0. 

Note that this definition gives an abstract syntax for expressions, and 
this abstract syntax is what we use in the formal semantics and proofs. 
In our examples we use a concrete syntax that assumes that juxtaposition 
(concatenation) is left-associative and has higher precedence than |, which 
is also left- associative, while * has the highest precedence, and we will use 
parentheses for grouping when these precedence and associativity rules get 
in the way. 

The language of a regular expression e, L{e), is traditionally defined 
through operations on sets. Intuitively, the languages of e and a are sin- 
gleton sets with the corresponding symbols, the language of ei 62 is given by 
concatenating all strings of L{ei) with all strings of ^(62), the language of 
ei I 62 is the union of the languages of ei and 62, and the language of el is the 
Kleene closure of the language of ei, that is, L* = IJi^o where = {e} 
and = LL*"^ for z > 



15|, p. 28] 



We are interested in a semantics for regexes, the kind of regular expres- 
sions used for pattern matching and searching, so we will define a matching 
relation for regular expressions, Informally, we will have e xy ^ y if and 
only if the expression e matches the prefix x of input string xy. 

Formally, we define via natural semantics, using the set of inference 
rules in Figure [TJ We have e xy ^ y if and only if we can build a proof tree 
for this statement using the inference rules. The rules follow naturally from 
the expected behavior of each expression: rule empty. 1 says that e matches 
itself and does not consume the input; rule char.l says that a symbol matches 
and consumes itself if it is the beginning of the input; rule con.l says that 
a concatenation uses the suffix of the first match as the input for the next; 
rules choice. 1 and choice. 2 say that a choice can match the input using 
either option; finally, rules rep.l and rep. 2 say that a repetition can either 
match e and not consume the input or match its subexpression and match 
the repetition again on the suffix that the subexpression left. 

The following lemma proves that the set of strings that expression e 
matches is the language of e, that is, L(e) = {x & T* \3y e xy ^ y,y & T*}. 



Lemma 1. Given a regular expression e and a string x, for any string y we 
have that x G L(e) if and only if e xy y. 

Proof. (^): By induction on the complexity of the pair (e, x). Given the 



5 



Empty String — — (empty. 1) Character — — (char.l) 

e X X a ax X 



RE RE 

. ei xyz ^ yz e2 yz ^ z 
Concatenation — (con.l) 

ei 62 xyz z 



Choice — — — — (choice. 1) — — (choice. 2) 

I RE ^ ' I RE ^ ' 

ei \ 62 xy ^ y ei \ 62 xy y 



RE J; RE 

■r-. . . / , N e xyz ^ yz e yz ^ z , , 
Repetition ^ (rep.l) — ,x + e (rep. 2) 

e* X ^ X e* xyz ^ z 



Figure 1: Natural semantics of relation ^ 



pairs (ei, xi) and (e2, X2), the first pair is more complex than the second 
one if and only if either 62 is a proper subexpression of ei or ei = 62 and 
> \x2\. The base cases are {e, e) and (a, a), and their proofs follow 
by application of rules empty.l and char.l, respectively. Cases (6162, x) 
and (ei | 62, x) use a straightforward application of the induction hypothesis 
on the subexpressions, followed by application of rule con.l or one of the 
choice rules. Case (e*, e) follows directly from rule rep.l, while for case 
(e*, x), where x ^ e, we know by the definition of the Kleene closure that 
X G L\ei) with i > 0, where L\ei) is L{ei) concatenated with itself i 
times. This means that we can decompose x into X1X2, with a non-empty Xi, 
where xi G L{ei) and X2 G L'~^(ei). Again by the definition of the Kleene 
closure this means that X2 G L{el). The proof now follows by the induction 
hypothesis on (ei, xi) and {el, X2) and an application of rule rep. 2. 

(<^): By induction on the height of the proof tree for e xy ^ y. Most 
cases are straightforward; the interesting case is when the proof tree concludes 
with rule rep. 2. By the induction hypothesis we have that x G L{ei) and 
y G L{el). By the definition of the Kleene closure we have that y G L'^{ei), so 
xy G L*+^(ei) and, again by the Kleene closure, xy G L{el), which concludes 
the proof. □ 

Salomaa [lij developed a complete axiom system for regular expressions, 
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where any valid equation involving regular expressions can be derived from 
the axioms. The axioms of system Fi are: 



ei 1 (e2 63) 


= (ei 62) 63 


(1) 


61(6263) 


= (6162)63 


(2) 


ei 1 62 


= 62 1 ei 


(3) 


61(62 1 63) 


= 6162 6163 


(4) 


(61 1 62)63 


= 6163 1 6263 


(5) 


e e 


= e 


(6) 


ee 


= e 


(7) 


0e 


= 


(8) 


6|0 


= e 


(9) 


e* 


1 * 

= e\e e 


(10) 


e* 


= {e\er 


(11) 



Salomaa's regular expressions do not have the e case; the original axioms 
use 0*, which has the same meaning, as the only possible proof trees for 0* 
use rep.l. The following lemma shows that these axioms are valid under our 
semantics for regular expressions, if we take ei = 62 to mean that ei and 62 
match the same sets of strings. 

Lemma 2. For each of the axioms of Fi, if I is the expression on the left 
side and r is the expression on the right side, we have that I xy ^ y if and 
only if r xy ^ y. 

Proof. Trivially true for axiom 8, as there are no proof trees for either the 
left or right sides of this axiom. For axioms 1 to 7 and for axiom 9 it is 
straightforward to use the subtrees of the proof tree of one side to build a 
proof tree for the other side. We can prove the validity of axiom 11 by an 
straightforward induction on the height of the proof trees for each side. 

For axiom 10, we need to prove the identity a*a = aa*, by induction on 
the heights of the proof trees. From this identity the left side of axiom 10 
follows by taking the subtrees of rep. 2 and combining them with con.l into 
a tree for aa*, which means we have a tree for a* a that we can use to build 
a tree for the right side using choice. 2. The right side follows from getting 
a tree for aa* from a tree for a*a using the identity, then taking its subtrees 
and using rep. 2 to get a tree for a*. □ 
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Parsing expression grammars borrow the syntax of regular expressions. 
A parsing expression is also defined inductively, extending the inductive def- 
inition of regular expressions with two new cases, A for a non-terminal, and 
!e for a not-predicate of expression e. A PEG G is a tuple (V, T, P,ps) where 
V is the set of non-terminals, T is the alphabet (set of terminals), P is a 
function from V to parsing expressions, and ps is the parsing expression that 
the PEG matches (its starting parsing expression). We will use the notation 
G\p] for a grammar derived from G where ps is replaced by p while keeping 
V, T, and P the same. We will refer to both regular expressions and parsing 
expressions as just expressions, letting context disambiguate between both 
kinds. 

While the syntax of parsing expressions is similar to the syntax of reg- 
ular expressions, the behavior of the choice and repetition operators is very 
different. Choice in PEGs is ordered; a PEG will only try to match the right 
side of a choice if the left side cannot match any prefix of the input. Rep- 
etition in PEGs is possessive; a repetition will always consume as much of 
the input as it can match, regardless of whether this leads to failure or a 
shorter match for the whole patterE0. To formally define ordered choice and 
possessive repetition we also need a way to express that an expression does 
not match a prefix of the input, so we need to introduce fail as a possible 
outcome of a match. 

Figure [2] gives the definition of -S, the matching relation for PEGs. As 
with regular expressions, we say that G xy ^ y to express that the grammar 
G matches the prefix x of input string xy, and the set of strings that a PEG 
matches is its language: L{G) = {x & T \ 3y G xy ^ y, y E T*}. 

We mark with a * the rules that have been changed from Figure [H and 
mark with a + the rules that were added. Unmarked rules are unchanged 
from Figure [H except for the trivial change of adding the parameter G to 
the relation. We have six new rules, and two changed rules. New rules 
char. 2 and char. 3 generate fail in the case that the expression cannot 
match the symbol in the beginning of the input. New rule con. 2 says that a 
concatenation fails if its left side fails. New rule var.l says that to match a 
non-terminal we have to match the parsing expression associated with it in 
this grammar's function from non-terminals to parsing expressions (the non- 



^Possessive repetition is a consequence of ordered choice, as e* is the same as expression 
A where A is a fresh non-terminal and P{A) — eA \ e. 
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Empty String (empty. 1) Non-terminal (var.l)^ 

G[e] X G[A] x^ X 

Terminal (char.l) — , b ^ a (char.2)+ j^gg {char.3)^ 

G[a] ax ^ X G[b] ax fail G[a\ s fail 

„ , ,. G[px]xy^y G\p2] y ^ X G[pi] x "^P f ail _,. 
Concatenation — j^g^ (con.l) (con. 2)^ 

G[pi p2] xy X G[pi p2] X fail 

Ordered Choice ^M^?^^|_ (choice. 1) Gbi] x ^ fail G^] x X ^^^^.^^^y 

G\pi\p2] xy y G[pi|p2] x --^ X 

T, ^.-j.. ©[pJx'^Ffail G\p\xyz^yz G\p*] yz z 
Repetition — (rep.l) — (rep.2) 

G[p*] X X G[p*\ xyz ^ z 

Not Predicate ^^i^^-^— (not.l)+ '^^^^ ^^PBcT ^ (not. 2)+ 

G[!p] X X G[\p\ xy ~> fail 



Figure 2: Definition of Relation ^ througli Natural Semantics 



terminal's "right-hand side" in the grammar). New rules not.l and not. 2 
say that a not predicate never consumes input, but fails if its subexpression 
matches a prefix of the input. 

The change in rule con.l is trivial and only serves to propagate fail, so 
we do not consider it an actual change. The changes to rules choice. 2 and 
rep.l are what actually implements ordered choice and possessive repetition, 
respectively. Rule choice. 2 says that we can only match the right side of 
the choice if the left side fails, while rule rep.l says that a repetition only 
stops if we try to match its subexpression and fail. 

It is easy to see that PEGs are deterministic; that is, a given PEG G can 
only have a single result (either fail or a suffix of x) for some input x, and 
only a single proof tree for this result. If the PEG G always yields a result 
for any input in T* then we say that G is complete [l^. PEGs that are not 
complete include any PEG that has left recursion and PEGs with repetitions 
e* where e matches the empty string. From now on we will assume that any 
PEG we consider is complete unless stated otherwise. The completeness of 
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a PEG can be proved syntactically 10 



The syntax of the expressions that form a PEG are a superset of the 
syntax of regular expressions, so syntactically any regular expression e has a 
corresponding PEG Ge = (V,T, P, e), where V and P can be anything. We 
can prove that L{Ge) C L{e) by a simple induction on the height of proof 
trees for Ge xy ™ y, but it is easy to show examples where L{Ge) is a 
proper subset of L{e), so the regular expression and its corresponding PEG 
have different languages. 

For example, expression a \ ah has the language {a, ab} as a regular ex- 
pression but {a} as a PEG, because on an input with prefix ab the left side 
of the choice always matches and keeps the right side from being tried. The 
same happens with expression a{h\ bb), which has language {ab, abb} as a 
regular expression and {ab} as a PEG, and on inputs with prefix abb the left 
side of the choice keeps the right side from matching. 

The behavior of the PEGs actually match the behavior of the regexes a \ ab 
and a {b \ bb) on Perl-compatible regex engines. These engines will always 
match a \ ab with just the first a on subjects starting with ab, and always 
match a {b \ bb) with ab on subjects starting with abb, unless the order of the 
alternatives is reversed. 

A different situation happens with expression (a | aa) b. As a regular 
expression, its language is {ab, aab} while as a PEG it is {ab}, but the 
PEG fails on an input with prefix aab. Regex engines will backtrack and 
try the second alternative when b fails to match the second a, and will also 
match aab, highlighting the difference between the unrestricted backtracking 
of common regex implementations and PEGs' restricted backtracking. 

If we take the previous regular expression, (a | aa) b, and distribute b over 
the two alternatives, we have ab \ aab. This expression now has the same 
language when we interpret it either as a regular expression, as a PEG, or 
as a regex. 

If we change a \ ab and a {b \ bb) so they have the prefix propert^ by adding 
an end-of-input marker $, we have the expressions (a | ab) $ and a {b \ bb) $. 
Now their languages as regular expressions are {a$, ab$} and {ab$, abb$}, 
respectively, but the first expression fails as a PEG on an input with prefix 
ab and the second expression fails as a PEG on an input with prefix abb. 

Both (a I ab) $ and a {b \ bb) $ match, as regexes, the same set of strings that 



^ There are no distinct strings x and y in the language such that a; is a prefix of y. 
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form their languages as regular expressions. They are in the form (ci | 62) 63, 
like (a | aa) b, so we can distribute 63 over the choice to obtain ei 63 | 62 63. If 
we do that the two expressions become a$ | ab$ and a (6$ | 66$), respectively, 
and they now have the same language either as a regular expression, as a 
PEG, or as a regex. 

We will say that a PEG G and a regular expression e over the same 
alphabet T are equivalent if the following conditions hold for every input 
string xy: 

„ EHG RE 

G xy ^ y => e xy ^ y [12) 
e xy ^ y =^ G xy ^ fail (13) 

That is, a PEG G and a regular expression e are equivalent if L{G) C L(e) 
and G does not fail for any string with a prefix in L{e). In the examples 
above, regular expressions a\ab, a{b\bb), a$\ab$, a(6$|66$), and ab\aab 
are all equivalent with their corresponding PEGs, while (a | ab) $, a{b \ bb) $, 
and (a | aa) 6 are not. 

Informally, a PEG and a regular expression will be equivalent if the PEG 
matches the same strings as the regular expression when the regular expres- 
sion is viewed as a regex under the common "leftmost-first" semantics of 
Perl-compatible regex implementations. If a regular expression can match 
different prefixes of the same subject, due to the non-determinism of the 
choice and repetition operations, the two conditions of equivalence guaran- 
tee that an equivalent PEG will match one of those prefixes. 

Regexes are deterministic, and will also match just a single prefix of 
the possible prefixes a regular expression can match. Our transformation 
will preserve the ordering among choices, so the prefix an equivalent PEG 
obtained with our transformation matches will be the same prefix a regex 
matches. 

While equivalence is enough to guarantee that a PEG will give the same 
results as a regex, equivalence together with the prefix property yields the 
following lemma: 

Lemma 3. // a regular expression e with the prefix property and a PEG G 
are equivalent then L{G) — L{e). 

Proof. As our first condition of equivalence says that L{G) C L{e). we just 
need to prove that L(e) C L{G). Suppose there is a string x e L{e); this 
means that e xy ^ y for any y. But from equivalence this means that 
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PBG PEG 

G xy 'pi fail. As G is complete, we have G xy y' . By equivalence, the 
prefix of xy that G matches is in L{e). Now y cannot be a proper suffix of 
y' nor y' a proper suffix of or the prefix property would be violated. This 
means that y' = y, and x G L{G), completing the proof. □ 

We can now present an overview on how we will transform a regular ex- 
pression e into an equivalent PEG. We first need to transform subexpressions 
of the form (ci | 62) 63 to 6163 | 6263. We do not need to actually duplicate 
63 in both sides of the choice, potentially causing an explosion in the size 
of the resulting expression, but can introduce a fresh non-terminal X with 
P{X) = 63, and distribute X to transform (ei | 62) 63 into e.iX \ e2X . 

Transforming repetition is trickier, but we just have to remember that 
6^62 = (ei e\ I £:) 62 = (ci e]; 62) | €2- Naively transforming the first expression 
to the third does not work, as we end up with e*e2 in the expression again, 
but we can add a fresh non-terminal A to the PEG with P{A) = ciA \ 62 
and then replace e\e2 with A in the original expression. The end result of 
repeatedly applying these two transformation steps until we reach a fixed 
point will be a parsing expression that is equivalent to the original regular 
expression. 

As an example, let us consider the regular expression 6*6$. Its language 
is {b$, bb$, . . .}, but when interpreted as a PEG the language is 0, due to 
possessive repetition. If we transform the original expression into a PEG 
with starting parsing expression A and P{A) = bA \ 6$, it will have the same 
language as the original regular expression; for example, given the input 
bb$, this PEG matches the first b through subexpression b of bA, and then 
A tries to match the rest of the input, b$. So, once more subexpression b of 
bA matches b and then A tries to match the rest of the input, $. Since both 
bA and 6$ fail to match $, A fails, and thus bA fails for input b$. Now we 
try 6$, which successfully matches b$, and the complete match succeeds. 

If we now consider the regular expression b*b, which has {b, bb, . . .} as its 
language, we have that it also has the empty set as its language if we interpret 
it as a PEG. A PEG with starting parsing expression A and P{A) = bA\b 
also has {b, bb, . . .} as its language, but with an important difference: the 
regular expression can match just the first b of a subject starting with bb, but 
the PEG will match both, and any other ones that follow, so we do not have 
e xy ^ y implying G xy ^ y. But the behavior of the PEG corresponds the 
behavior of regex engines, which use greedy repetition, where a repetition will 
always match as much as it can while not making the rest of the expression 
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n(e, Gk) = Gk n(a, Gk) = Gk[apk] n(ei 62, Gk) = n(ei, n(e2, Gk)) 
n(ei I 62, Gk) = G2bi I P2I where G2 = (F2, T, P2, P2) = n(e2, (V^i, T, Pi, p^)) 

and (Fi,r, Pi,pi) = n(ei,Gfe) 
n(ej, Gfe) = G , where G = (Vi, T, Pi U ^ Pi\Pk}, A) with A ^ Vfc and 

(Fi, T, Pi, pi) = n(ei, (T4 U {A}, T, P^, ^)) 

Figure 3: Definition of Function 11, where Gk = (Vfe, T, Pk, Pk) 

fail. 

The next section formalizes our transformation, and proves that for any 
regular expression e it will give a PEG that is equivalent to e, that is, it 
will yield a PEG that recognizes the same language as e if it has the prefix 
property. 

3. Transforming Regular Expressions to PEGs 

This section presents function 11, a formalization of the transformation 
we outlined in the previous section. The function H transforms a regular 
expression e using a PEG Gk that is equivalent to a regular expression to 
yield a PEG that is equivalent to the regular expression eck- 

The intuition behind 11 is that Gk is a continuation for the regular ex- 
pression e, being what should be matched after matching e. We use this 
continuation when transforming choices and repetitions to do the transfor- 
mations of the previous section; for a choice, the continuation is distributed 
to both sides of the choice. For a repetition, it is used as the right side for the 
new non-terminal, and the left side of this non-terminal is the transformation 
of the repetition's subexpression with the non-terminal as continuation. 

For a concatenation, the transformation is the result of transforming the 
right side with Gk as continuation, then using this as continuation for trans- 
forming the left side. This lets the transformation of (ci | 62) 63 work as 
expected: we transform 63 and then use the PEG as the continuation that 
we distribute over the choice. 

We can transform a standalone regular expression e by passing a PEG 
with e as starting expression as the continuation; this gives us a PEG that 
is equivalent to the regular expression e£, or e. 
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FigureOhas the definition of function 11. Notice how repetition introduces 
a new non-terminal, and the transformation of choice has to take this into 
account by using the set of non-terminals and the productions of the result of 
transforming one side to transform the other side, so there will be no overlap. 
Also notice how we transform a repetition by transforming its body using 
the repetition itself as a continuation (through the introduction of a fresh 
non-terminal), then building a choice between the transformation and the 
body and the continuation of the repetition. The transformation process is 
bottom- up and right-to-left. 

We will show the subtler points of transformation 11 with some examples. 
In the following discussion, we use the alphabet T = {a, b, c}, and the con- 
tinuation grammar = (0, T, 0, e) that is equivalent to the regular expres- 
sion e. In our first example, we use the regular expression (a | 6 | c)* a (a | 6 | c)*, 
which matches an input that has at least one a. 

We first transform the second repetition by evaluating n((a | h \ c)*, Gfc); 
we first transform a | 6 | c with a new non-terminal A as continuation, yielding 
the PEG aA \ bA \ cA, then combine it with e to yield the PEG A where A 
has the production below: 

A ^ aA \ bA \ cA \ e 

Next is the concatenation with a, yielding the PEG aA. We then use this 
PEG as continuation for transforming the first repetition. This transforma- 
tion uses a new non-terminal S as a continuation for transforming a | 6 | c, 
yielding aB \ bB \ cB, then combines it with aA to yield the PEG B with the 
productions below: 

B^ aB \ bB \ cB \ aA A^ aA \ bA \ cA \ e 

When the original regular expression matches a given input, we do not 
know how many a's the first repetition matches, because the semantics of 
regular expressions is non-deterministic. Regex implementations commonly 
resolve ambiguities in repetitions by the longest match rule, where the first 
repetition will match all but the last a of the input. PEGs are deterministic 
by construction, and the PEG generated by 11 obeys the longest match rule. 
The alternative a A of non-terminal B will only be tried if all the alternatives 
fail, which happens in the end of the input. The PEG then backtracks until 
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the last a is found, where it matches the last a and proceeds with non-terminal 
A. 

The regular expression {b\c)* a {a\h\c)* defines the same language as the 
regular expression of the first example, but without the ambiguity. Now 11 
with continuation Gk yields the following PEG B: 

hB \ cB \ aA A^ aA \ bA \ cA \ e 

Although the productions of this PEG and the previous one match the 
same strings, the second PEG is more efficient, as it will not have to reach 
the end of the input and then backtrack until finding the last a. This is 
an example on how we can use our semantics and the transformation 11 to 
reason about the behavior of a regex. The relative efficiency of the two PEGs 
is an artifact of the semantics, while the relative efficiency of the two regexes 
depends on how a particular engine is implemented. In a backtracking imple- 
mentation it will depend on what ad-hoc optimizations the implementation 
makes, in an automata implementation they both will have the same relative 
efficiency, at the expense of the implementation lacking the expressive power 
of some regex extensions. 

The expressions in the two previous examples are well-formed. A regular 
expression e is well-formed if it does not have a subexpression e* where e G 
L{ei). If e is a well-formed regular expression and Gk is a complete PEG then 
n(e, Gk) is also complete. In Section [3?T] we will show how to mechanically 
obtain a well-formed regular expression that recognizes the same language as 
a non-well-formed regular expression while preserving its overall structure. 

We will now prove that our transformation 11 is correct, that is, if e is 
a well-formed regular expression and Gk is a PEG equivalent to a regular 
expression Ck then n(e, Gk) is equivalent to eck- The proofs use a small 
technical lemma: each production of PEG Gk is also in PEG n(e, Gk), for any 
regular expression e. This lemma is straightforward to prove by structural 
induction on e. 

We will prove each property necessary for equivalence separately; equiv- 
alence will then be a direct corollary of those two proofs. To prove the first 
property we need an auxiliary lemma that states that the continuation gram- 
mar is indeed a continuation, that is if the PEG n(e, Gk) matches a prefix 
X of a given input xy then we can split x into v and w with x = vw and Gk 
matching w. 
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Lemma 4. Given a well-formed regular expression e, a PEG Gk, and an 
input string xy, ifll{e, Gk) xy y then there is a suffix w of x such that 
Gfc wy ^ y. 

Proof. By induction on the complexity of the pair (e, xy). The interesting 
case is e*. In this case n(e*, G^) gives us a grammar G = (Vi, T, P, A), 
where A ^ pi \ pk. By var.l we know that G[pi | Pk] xy ^ y. There are 
now two subcases to consider, choice. 1 and choice. 2. 

For subcase choice. 2, we have G\pk] xy ^ y. But then we have that 
Gk[Pk] xy ^ y because any non-terminal that pk uses to match xy is in both 
G and Gk and has the same production in both. The string xy is a suffix of 
itself, and pk is the starting expression of Gk, closing this part of the proof. 

For subcase choice. 1 we have n(e, n(e*, Gk)) xy ^ y, and by the 
induction hypothesis n(e*, Gk) wy y. We can now use the induction 
hypothesis again, on the length of the input, as w must be a proper suffix of 
X. We conclude that Gk w'y ™ y for a suffix w' of w, and so a suffix of x, 
ending the proof. □ 

The following lemma proves that if the first property of equivalence holds 
between a regular expression and a PEG Gk then it will hold for e Ck and 
n(e, Gk) given a regular expression e. 



Lemma 5. Given two well-formed regular expressions e and Ck and a PEG 

PEXj PE PEG 

Gk, where Gk wy y ^ Ck wy ^ y, if n(e, Gk) vwy y then 

RE 

eck vwy ^ y. 

Proof. By induction on the complexity of the pair (e, vwy). The interesting 
case is e*. In this case, n(e*, Gk) gives us a PEG G = (Vi, T, P, A), where 
A ^ Pi\ Pk- By var.l we know that G[pi \ pk] vwy ™ y. There are now two 
subcases, choice. 1 and choice. 2 of -S. 

For subcase choice. 2, we can conclude that Gkvwy ^ y because pk is 
the starting expression of Gk and any non-terminals it uses have the same 

RE 



production both in G and Gk- We now have vwy ^ y. By choice. 2 of 

RE 1 J, I RE 1 , :ifc I :k * 

^ we have ee | vwy ^ y, but ee ek\ Ck = e e^, so e Ck vwy ^ y, 
ending this part of the proof. 

For subcase choice. 1, we have n(e, n(e*, Gk)) vwy ^ y, and by LemmaH] 
we have n(e*, Gk) wy ^ y. The string v is not empty, so we can use 
the induction hypothesis and Lemma H] again to conclude e* Ck wy ^ y. 
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Then we use the induction hypothesis on n(e, n(e*, Gk)) vwy ^ y to 
conclude ee*ek vwy y. We can now use rule choice. 1 of to get 

ee* Cfe I Cfc vwy ^ y, but ee* | = e* e^, so e* vwy ending the 

proof. □ 

The following lemma proves that if the second property of equivalence 
holds between a regular expression Ck and a PEG Gk then it will hold for 
e Ck and n(e, Gfc) given a regular expression e. 

Lemma 6. Given well-formed regular expressions e and and a PEG Gk, 
where Lemma holds and we have Cfc wy y ^ Gk wy fail, if 
eck vwy y then n(e, Gk) vwy ^ fail. 

Proof. By induction on the complexity of the pair (e, vwy). The interesting 
case is e*. We will use again the equivalence e* Cfc = ee*efc | e^. There are 
two subcases, choice. 1 and choice. 2 of "S. 

For subcase choice. 1, we have that e matches a prefix of vwy by rule 
con.l. As e* is well-formed this prefix is not empty, so e* f'wy ?/ for a 
proper suffix t>' of v. By the induction hypothesis we have n(e*, Gk) v'wy ^ 
fail, and by induction hypothesis again we get n(e, n(e*, Gk)) vwy ^ 
fail. This PEG is complete, so we can conclude n(e*, Gk)[pi \ Pk\ vwy ^ 
fail using rule choice. 1 of and then n(e*, Gk) vwy -f^ fail by rule 
var.l, ending this part of the proof. 

For subcase choice. 2, we can assume that there is no proof tree for the 
statement e e* vwy y, or we could reduce this subcase to the first one by 
using choice. 1 instead of choice. 2. Because n(e, n(e*, Gk)) is complete we 
can use modus toUens of LemmaOto conclude that n(e, n(e*, Gk)) vwy 

RE PEG 

fail. We also have vwy ^ y, so Gk vwy fail. Now we can 

• F'EXjl r n PEG 

use rule choice. 2 of to conclude G\p\ \ p^ vwy ^ fail, and then 
n(e*, Gk) vwy ^ fail by rule var.l, ending the proof. □ 

The correctness lemma for 11 is a corollary of the two previous lemmas: 

Lemma 7. Given well-formed regular expressions e and ek and a PEG Gk, 

where ek and Gk are equivalent, then n(e, Gk) and e ek are equivalent. 

Proof. The proof that first property of equivalence holds for n(e, Gk) and e 
follows from the first property of equivalence for and Gk plus Lemma |5l 
The proof that the second property of equivalence holds follows from the 
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first property of equivalence for n(e, Gk) and ee^, the second property of 
equivalence for and Gk, plus Lemma Ei □ 



A corollary of the previous lemma combined with Lemma|3]is that L{e $) = 
L(n(e, $)), proving that our transformation can yield a PEG that recognizes 
the same language as any well-formed regular expression e just by using 
an end-of-input marker, even if the language of e does not have the prefix 
property. 

It is interesting to see whether the axioms of system Fi (presented on page 
7) are still valid if we transform both sides using 11 with e as the continuation 
PEG, that is, if / is the left side of the equation and r is the right side then 
Il{l,e) xy ^ y ii and only if n(r, e) xy ^ y. This is straightforward for 
axioms 1, 2, 4, 6, 7, 8, and 9; in fact, it is easy to prove that these axioms 
will be valid for any PEG, not just PEGs obtained from our transformation. 

Applying 11 to both sides of axiom 2, 61(6263) and (6162)63, makes them 
identical; they both become n(6i, 11(62, 11(63, G^))). The same thing hap- 
pens with axiom 5, (61 | 62)63 = 6163 | 6263; the transformation of the left 
side, n((6i I 62)63, Gk)-, becomes n(6i, 11(63, Gk)) \ n(62, 11(63, Gk)) via the in- 
termediate expression n(6i | 62, 11(63, G^)), while the transformation of the 
right side, 11(6163 | 6263, Gk), also becomes n(6i, 11(63, Gk)) \ 11(62, n(63, Gk)), 
although via the intermediate expression 11(6163, Gk) \ 11(6263, Gk)- 

The transformation of axiom 3, 61 1 62 = 62 | 61, will not be valid; the 
left side becomes the PEG n(6i,Gfc) 1 11(62, G^) and the right side becomes 
the PEG n(62, Gk) I n(6i, Gk), but ordered choice is not commutative in the 
general case. One case where this choice is commutative is if the language 
of 61 1 62 has the prefix property. We can use an argument analogous to the 
argument of Lemma E] to prove this, which is not surprising, as this lemma 
together with Lemma |2] implies that this axiom should hold for expressions 
with languages that have the prefix property. 

Axiom 10, e* = e \ e*e, needs to be rewritten as e* = e*e | £ or it is 
trivially not valid, as the right side will always match just e- Again, this is 
not surprising, as e \ e*e does not have the prefix property, and this is the 
same behavior of regex implementations. Rewriting the axiom as e* = e*e \ e 
makes it valid when we apply 11 to both sides, as long as e* is well-formed. 

The left side becomes the PEG A where A — )• 11(6, A) \ e, while the right 
side becomes B \ e where B — )■ 11(6, B) \ 11(6, e). If 11(6, A) fails it means that 
n(6, e) would also fail, and so do 11(6, 5), then B fails and the B \ e succeeds 
by choice. 2. Analogous reasoning holds for the other side, if B fails. If 
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n(e, A) succeeds then n(e, e) matches a non-empty prefix, and A matches 
the rest, and we can assume that B \ e matches this rest by induction on the 
length of the matched string. We can use this to conclude that B \ e also 
succeeds. Again, analogous reasoning holds for the converse. 

The right side of axiom 11, (e | e)*, is not well-formed, and applying 11 
to it would lead to a left-recursive PEG with no possible proof tree. We 
still need to show that any regular expression can be made well-formed with- 
out changing its language. This is the topic of the next section, where we 
give a transformation that rewrites non-well-formed repetitions so they are 
well-formed with minimal changes to the structure of the original regular 
expression. Applying this transformation to the right side of axiom 11 will 
make it identical to the left side, making the axiom trivially valid. 

3.1. Transformation of Repetitions e* where e G L{e) 

A regular expression e that has a subexpression e* where Ci can match the 
empty string is not well-formed. As Cj can succeed without consuming any 
input one outcome of e* is to stay in the same place of the input indefinitely. 
Regex libraries that rely on backtracking may enter an infinite loop with non- 
well- formed expressions unless they take measures to avoid it, using ad- hoc 



rules to detect and break the resulting infinite loops |17 . 

When e is not well-formed, the PEG we obtain through transformation 11 
is not complete. A PEG that is not complete can make a PEG library enter 
an infinite loop. To show an example on how a non- well- formed regular 
expression leads to a PEG that is not complete, let us transform {a\e)*b 
using n. Using e as continuation yields the following PEG A: 

A ^ aA \ A \ b 

The PEG above is left recursive, so it is not complete. In fact, this PEG 
does not have a proof tree for any input, so it is not equivalent to the regular 
expression {a\e)* b. 

Transformation 11 is not correct for non-well-formed regular expressions, 
but we can make any non-well-formed regular expression well-formed by 
rewriting repetitions e* where e G L{ei) as e'* where e ^ L{e[) and L[e'*) = 
L{e*). The regular expression above would become a* \ b, which IT transforms 
into an equivalent complete PEG. 

This section presents a transformation that mechanically performs this 
rewriting. We use a pair of functions to rewrite an expression, font and fin- 
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empty (^e) 


— true 


eiiipbtj \ ti j 


TSl 1 GO 


empty [e ) 


= empty (e) 


empty {ei 62) 


= empty (ei) A empty (62) 


empty {ei 62) 


= empty (ei) A empty (62) 


nuU{e) 


= true 


null{a) 


= false 


nuU{e*) 


= true 


null{ei 62) 


= null{ei) A nuU{e2) 


null{ei 62) 


= null{ei) V nuU{e2) 



Figure 4: Definition of predicates empty and null 

Function font recursively searches for a repetition that has e in the language 
of its subexpression, while fm rewrites the repetition's subexpression so it 
is well-formed, does not have e in its language, and does not change the 
language of the repetition. Both and font use two auxiliary predicates, 
empty and null, that respectively test if an expression is equal to e (if its 
language is the singleton set {e}) and if an expression has e in its language. 
Figure H] has inductive definitions for the empty and null predicates. 

Function f^ut is simple: for the base expressions it is the identity, for the 
composite expressions font applies itself recursively to subexpressions unless 
the expression is a repetition where the repetition's subexpression matches e. 
In this case font transforms the repetition to e if the subexpression is equal 
to e (as e* = e), or uses fm to rewrite the subexpression. Figured has the 
inductive definition of font- It obeys the following lemma: 

Lemma 8. If fin{ek) is well-formed, e ^ L{fin{ek)), and L{fin{ek)*) = L{el) 
for any Ck withe G L{ek) and L{ek) 7^ e then, for any e, fouti^) is well-formed 
and L{e) = L{fout{e)). 

Proof. By structural induction on e. Inductive cases follow directly from the 
induction hypothesis, except for e* where e G L{e), where it follows from the 
properties of □ 
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fout{e) = e, if e = e or e = a 
fout{eie2) = fout{ei) fout{e2) 
fout{ei 1 62) = fout{ei) I fout{e2) 

fout{e)* if ^null{e) 
fout{e*) = { £ if empty (e) 

fm{e)* otherwise 

Figure 5: Definition of Function fout 
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/m(ei I 62) 



/m(ei 1 62) 






/m(e2) 




if empty (ei) and null{e2) 


fout{e2) 




if empty{ei) and -inuU{e2) 


/m(ei) 




if null{ei) and empty{e2) 


< /o«i(ei) 




if -^null{ei) and empty{e2) 


/o«t(ei) 




if -inull{ei) and -> empty {e2 


/m(ei) 


1 fout {€2) 


if -^empty^ei) and -'nuU{e2 


, /»n(ei) 


1 /m(e2) 


otherwise 


f /m(e) 


if nuU{e) 




1 /o«t(e) 


otherwise 





Figure 6: Definition of Function fin{e), where -^e'mpty{e) and null{e) 



Function /j„ does the heavy hfting of the rewriting, it is used when 
finds an expression e* where empty {e) and nuU{e). If its argument is a 
repetition it throws away the repetition because it is superfluous. Then 
apphes fout or itself to the subexpression depending on whether it matches 
e or not. If the argument is a choice fin throws away one of the sides if its 
equal to £, as it is superfluous because of the repetition, and rewrites the 
remaining side using f^ut or fi„ depending on whether it matches e or not. 
In case both sides are not equal to e fin rewrites both. If the argument is a 
concatenation /j„ rewrites it as a choice and applies itself to the choice. 

Transforming a concatenation into a choice obviously is not a valid trans- 
formation in the general case, but it is safe in the context of fin] fin is working 
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inside a repetition expression, and its argument has e in its language, so we 
can use an identity involving languages and the Kleene closure that says 
(AB)* = {AU B)* li e ^ A and e e B. Figure |6] has the inductive definition 
of fin- It obeys the following lemma: 

Lemma 9. Iffouti^k) is well-formed and L{f outi^k)) = L{ek) for any then, 
for any e with e E L{e) and L(e) ^ {e}, e ^ L{fin{e)), L{e*) = L{fin{e)*), 
and fin{e) is well-formed. 

Proof. By structural induction on e. Most cases follow directly from the 
induction hypothesis and the properties of font- The subcases of choice 
where the result is also a choice use the Kleene closure property [AU B)* = 
{A* UB*)* together with the induction hypothesis and the properties of f out- 
Concatenation becomes to a choice using the property mentioned above this 
lemma. □ 

As an example, let us use font and fm to rewrite the regular expression 
{bc\a* {d\e))* into a well-formed regular expression. We show the sequence 
of steps below: 

fout{ibc\a*{d\e)y) = {f,n{bc\a*{d\e))y = (/„„i(6c) | /„(a* (rf | e)))* 

= ifoutib) foutic) I {f^nia*) \ /„(c/ | £:)))* 

= {bc\{foutia)\foutid))r = {bc\{a\d)y 

The idea is for rewriting to be automated, and transparent to the user of 
regex libraries based on our transformation, unless the user wants to see how 
their expression can be simplified. Notice that just the presence of e inside a 
repetition does not mean that a regular expression is not well-formed. The 
{bc\a{d\ e)y expression looks very similar to the previous one, but is well- 
formed and left unmodified by fout- 

4. Optimizing Search and Repetition 

A common application of regexes is to search for parts of a subject that 
match some pattern, but our formal model of regular expressions and PEGs 
is anchored, as our matches must start on the first symbol of the subject 
instead of starting anywhere. It is easy to build a PEG that will search for 
another PEG (V, T,P, S), though, we just need to add a new non-terminal 
S' as starting pattern, with S" — )■ S* | .5" as a new production, where . is 
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a shortcut for a regular expression that matches any terminal. If trying to 
match S from the beginning of the subject fails, then the PEG skips the first 
symbol and tries again on the second. 

The search pattern works, but can be very inefficient if the PEG engine 
always has to use backtracking to implement ordered choice, as advancing 
to the correct starting position may involve a large amount of advancing 
and then backtracking. A related problem occurs when converting regex 
repetition into PEGs, as the PEG generated from the regular expression e^e2 
will greedily try to consume as much of the subject with ei as possible, then 
try 62 and backtrack each match of ci until 62 succeeds or the whole pattern 
fails. In the rest of this section we will show how we can use properties of the 
expressions we are trying to search or match in conjunction with syntactic 
predicates to reduce the amount of backtracking necessary in both searches 
and repetition expressions. 

4-1- Search 

The search pattern for a PEG tries to match the PEG then advances 
one position in the subject and tries again if the match fails. A simple way 
to improve this is to advance several positions if possible, skipping starting 
positions that have no chance of a successful match. If we know that a 
successful match always consumes part of the subject and begins with a 
symbol in a set J- then we can skip a failing starting position with the pattern 
\[J-']., where [J-'] is a character set pattern that matches any symbol in the 
set. We can skip a string of failing symbols with the pattern (![^] .)*. The 
new search expression for the PEG with starting pattern p can be written as 
follows: 

s^m .Yip\-s) 

The set F for a pattern p derived from a regular expression e is just the 
FIRST set of e, which has a simple definition in terms of the ^ relation 
below: 

FIRST{e) = {a G T I 3x, y e axy ^ y, x,y E T*} 

It is easy to prove that the two search expressions are equivalent by 
induction on the height of the corresponding derivation trees. The tricky 
case, where .)*, just uses the definition of FIRST to build a tree of 

successive applications of rule ord.2 until we can use the induction hypothesis 
in its rightmost leaf. 
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4-2. Repetition 

If two regular expressions Ci and 62 have disjoint FIRST sets then it 
is safe to match e^e2 using possessive repetition. This means that we can 
transform 6^62 into the PEG plp2 where pi and p2 are the PEGs we get 
from transforming ei and 62- Formally, we can define n(eje2, Gk) when 
FIRST{ei) n FIRST{e2) = as follows, where we use G2[e] as the con- 
tinuation for transforming ci just to avoid collisions on the names of non- 
terminals: 

n{ele2, Gk) = {Vi, T, P,, p\p2) 
where (^i, T, Pi, pi) = n(ei, G2[e\) 
and G2 = (1^2, T, P2, P2) = n(e2, G^) 

The easiest way to prove the correctness of the new rule is by proving 
the equivalence of the PEGs we get from Yl{e\e2i Gk) using the old and new 
rule. This is a straightforward induction on the height of the proof trees for 
these PEGs, using the fact that disjoint FIRST sets for ei and 62 implies 
disjointedness of their equivalent PEGs. 

In the general case, where the FIRST sets of Ci and 62 are not disjoint, 
we can still avoid some amount of backtracking on 6^62 by being possessive 
whenever there is no chance of 62 doing a successful match, as backtracking 
to a point where 62 cannot match is useless. The idea is to use a predicated 
repetition of pi before doing the choice piA \ p2 that guarantees that the 
PEG will backtrack to a point where p2 matches, if possible. We can use the 
FIRST set of 62 as an approximation to the set of possible matches of 62, 
and the PEG for 6*62 becomes A {\[F I RST (62)] Pi)*{piA \ P2). The full 
rule for Il{ele2, Gk) becomes as follows: 

n(ete2, Gk) = (Vi U {A}, T, P,U{A^ {\[FIRST{e2)] Pi)*{piA \ P2)}, A) 

where {Vi, T, Pi, p{) = n(ei, G2[e]) 
with A^Vi and G2 = (^, T, P2, P2) = n(e2, Gk) 

Again, the easiest way to prove that this new rule is correct is by proving 
the equivalence of the PEGs obtained from the old and the new rule, by 
induction on the height of the corresponding proof trees. 

4..3. Combining Search and Repetition 

We can further optimize the case where we want to search for the pattern 
6^62 or 616^62 (we will use as a shorthand for eie^), and all strings in L[e\) 
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have length one. We can safely skip the prefix of the subject that matches a 
possessive repetition of ei before trying again, because if the pattern would 
match from any of these positions then it would not have failed in the first 
place. We can combine this with our first search optimization to yield the 
following search pattern: 

s^m .Tipipis) 

In the pattern above, p is the starting expression of a PEG equivalent 
to the regular expression we are searching and pi is the starting expression 
of a PEG equivalent to ei with an empty continuation. Set J-" is still the 
FIRST set of the whole regular expression. If the FIRST sets of ei and 62 
are disjoint we can further optimize our search by breaking up p and using 
the following search expression to search for ele2'- 

s^im .ypiiP2\s) 

The special case searching for efe2 just uses the search expression S — >• 
{l[J-'] •)*Pi{p2 I S). Proofs that these optimizations are correct are straight- 
forward, by proving that these search expressions are equivalent to S* — )■ 
-Yip I -S) by induction on the height of the derivation trees. 



5. Benchmarks 

This section presents some benchmarks that compare a regex engine based 
on an implementation of our transformation with the resulting PEGs exe- 
cuted with LPEG, a fast backtracking PEG engine [llj. We compare this 
engine with PCRE, a backtracking regex engine that performs ad-hoc opti- 
mizations to cut the amount of backtracking needed [sl, and with RE2, a 
non-backtracking (automata-based) regex engine that nevertheless also in- 
corporates ad- hoc optimizations jsf. 

We tested our search and repetition optimizations with a series of bench- 
marks that search for the first successful match of a regular expression inside 



a large subject, the Project Gutenberg version of the King James Bible [18 
Our first benchmark searches for a single literal word in the subject, and 
serves as a simple test of the search optimization. Table [1] shows the results. 
We can see that the optimization is very effective, as LPEG optimizes the 
repetition in the search pattern to a single instruction of its parsing machine 
that scans the subject checking each character against a bitmap that encodes 
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Word 


RE2 


PCRE 


Unoptimized 


Search 


Line 


Geshurites 


1 


1 


12 


1 


19936 


worshippeth 


3 


3 


25 


4 


42140 


blotteth 


3 


3 


33 


6 


60005 


sprang 


7 


9 


47 


11 


80000 



Table 1: Time in milliseconds to search for a word 



Words 


RE2 


PCRE 


Unopt 


Search 


Repetition 


Line 


Adam - Eve 


1 














261 


Israel - Samaria 


2 


2 


32 


3 


3 


31144 


Jesus - John 


2 


3 


73 


6 


6 


76781 


Jesus - Judas 


2 


4 


81 


6 


6 


84614 


Jude - Jesus 


3 


4 


94 


7 


7 


98311 


Abraham - Jesus 


5 


5 


96 


8 


8 


no match 



Table 2: Time in milliseconds to search for two words in the same period 



the character set. RE2 and PCRE use ad-hoc optimizations to find the string 
and are still faster in some of the cases jsf. 

Our second benchmark searches for two literal words in the same pe- 
riod (separated by letters, spaces or commas), and we test the search and 
repetition optimizations, but cannot apply the combined optimization of Sec- 
tion 4.3 because the expression does not have the necessary structure. Table[2] 
shows the results, and we separate the optimizations to show the contribution 
of each one in the final result. The runtime is still dominated by having to 
find where in the subject the match is, so optimizing the repetition inside the 
pattern does not yield any gains. The pattern starts with a literal, so RE2 
and PCRE are using ad-hoc optimizations to find where the match begins. 

The third benchmark searches for a literal word that follows any other 
word plus a single space (a regular expression [a—zA—Z]'^jw, using character 
class notation and ^ for the empty space symbol). This pattern falls in 
the case where the FIRST sets of the repeated pattern and the pattern 
following the repetition are disjoint. We can apply the combined search 
and repetition optimization for this pattern, and compare it with the basic 
search and repetition optimizations. Table [3] shows the results. Now even 
the unoptimized PEG defeats a backtracking regex matcher, but the DFA- 
based RE2 is much faster. The FIRST set of the pattern includes most 
of terminals, and the search optimization is not effective. The biggest gain 
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Word 


RE2 


PCRE 


Unopt 


Search 


Rep 


Combined 


Line 


Geshurites 


5 


74 


57 


59 


38 


8 


19995 


worshippeth 


7 


156 


121 


126 


86 


18 


42140 


blotteth 


10 


208 


159 


167 


121 


24 


60005 


sprang 


12 


285 


222 


227 


147 


32 


80000 



Table 3: Time in milliseconds to search for a word following another 



Words 


RE2 


PCRE 


Unopt 


Search 


Rep 


Comb 


Line 


Adam - Eve 


2 


4 


6 


6 








261 


Israel - Samaria 


6 


504 


752 


750 


126 


8 


31144 


Jesus - John 


12 


1134 


1710 


1718 


278 


18 


76781 


Jesus - Judas 


13 


1246 


1884 


1892 


306 


20 


84614 


Jude - Jesus 


15 


1446 


2176 


2188 


364 


24 


98311 


Abraham - Jesus 


15 


1470 


2214 


2220 


362 


24 


no match 



Table 4: Time in milliseconds to search for a period containing two words 



comes from the combined optimization, as it lets the PEG skip large portions 
of the subject in case of failure, yielding a result that is much closer to RE2. 

The fourth and final benchmark extends the second benchmark by brack- 
eting the pattern with a pattern that matches the part of the period that 
precedes and follows the two words we are searching, yielding the pattern 
[a — zA — Z, ^]*wi[a — zA — Z,^]*W2[a — zA — Z,^]*. There is overlap in 
the FIRST sets of [a — zA — Z,J\, wi, and W2, so we need to use the more 
general form of the repetition optimization. We can also apply the combined 
optimization. As in the third benchmark, we compare this optimization with 
the basic search and repetition optimizations. Table H] shows the results. The 
effect of the repetition optimization is bigger in this benchmark, but what 
brings the performance close to a DFA-based regex matcher, and much better 
than a backtracking regex matcher, is still the combined optimization. 

Our benchmarks show that without optimizations our PEG-based engine 
performs on par with PCRE on more complex patterns. The optimizations 
bring it to a factor of 1 to 3 of the performance of RE2, a very efficient 
and well-tuned regex implementation that cannot implement common regex 
extensions due to its automata-based implementation approach. 
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n(?)ei, Gk) = {Vi, T, Pi, pipk), where {Vi, T, P,, p,) = n(ei, Gk[e]) 
n(et+, Gk) = n(?)e*, Gk) 

n(ef , Gk) = G, where G = (V^, T, P,u{A^pk\ Pi}, A) , 

(V^i, T, Pi, j9i) = n(ei, (Vk U {A}, T, Pfc, A)), and A ^ V^^ 
n(?!ei, G,) = (1^1, T, Pi, \p,pk), where (^i, T, Pi, pi) = n(ei, G^^]) 
n(?=ei, Gfc) = (1^1, T, Pi, \\p,pk), where (\/i, T, Pi, p,) = n(ei, G^ie]) 



Figure 7: Adapting Function 11 to Deal with Regex Extensions 



6. Transforming Regex Extensions 

Regexes add several ad-hoc extensions to regular expressions. We can eas- 
ily adapt transformation 11 to deal with some of these extensions, and this 
section shows how to use 11 with independent expressions, possessive repeti- 
tions, lazy repetitions, and lookahead. An informal but broader discussion 
of regex extensions in the context of translation to PEGs was published by 



Oikawa et al. 19 



The regex ?)ei is an independent expression (also known as atomic group- 
ing). It matches independently of the expression that follows it, so a failure 
when matching the expression that follows ?)ei does not force a backtracking 
regex matcher to backtrack to ?)ei's alternative matches. This is the same 
behavior as a PEG, so to transform ?)ei we first transform it using an empty 
continuation, then concatenate the result with the original continuation. 

The regex e^'^ is a possessive repetition. It always matches as most as 
possible of the input, even if this leads to a subsequent failure. It is the same 
as ?)e* if the longest-match rule is used. The semantics of 11 guarantees 
longest match, so it uses this identity to transform el'^. 

The regex e^' is a lazy repetition. It always matches as little of the input 
as necessary for the rest of the expression to match (shortest match). The 
transformation of this regex is very similar to the transformation of e^, we 
just flip pi and pk in the production of non-terminal A. Now the PEG tries 
to match the rest of the expression flrst, and will only try another step of 
the repetition if the rest fails. 

The regex ?!ei is a negative lookahead. The regex matcher tries to match 
the subexpression; it it fails then the negative lookahead succeeds without 
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consuming any input, and if the subexpression succeeds the negative looka- 
head fails. Negative lookahead is also an independent expression. Trans- 
forming this regex is just a matter of using PEGs negative lookahead, which 
works in the same way, on the result of transforming the subexpression as 
an independent expression. 

Finally, the regex ?= ei is a positive lookahead, where the regex matcher 
tries to match the subexpression and fails if the subexpression fails and suc- 
ceeds if the subexpression succeeds, but does not consume any input. It 
is also an independent expression. We transform a positive lookahead by 
transforming the subexpression as an independent expression and then using 
PEGs negative lookahead twice. 

None of these extensions has been formalized before, as they depend on 
the behavior of backtracking-based implementations of regexes instead of 
the semantics of regular expressions. We decided to formalize them in terms 
of their conversion to PEGs instead of trying to rework our semantics of 
regular expressions to accommodate them, as these extensions map naturally 
to concepts that are already part of the semantics of PEGs. 

The well-formedness rewriting of Section 13.11 needs to accommodate the 
new extensions. It is not possible to rewrite all non-well-formed expressions 
with these extensions while keeping their behavior the same, as these ex- 
tensions make it possible to write expressions that cannot give a meaningful 
result, such as | a))* or {7=a{d \ e))*. Other expressions can work with 
some subjects and not work with others, such as (?) (a | e | 6))* or (?= a(a \e))*. 

Our approach will be to rewrite problematic expressions so they give the 
same result for the subjects where they do not cause problems, but also give 
a result for other subjects, that is, they will match a superset of the strings 
that the original expression matches. For example, the four expressions above 
will be respectively rewritten to (?)a)*, d*, (?)(a | b))*, and a*. 

The empty predicate is true for ?!ei and ?=ei expressions, and empty{ei) 
for the other extensions. This is a conservative definition, as expressions such 
as ?)(£ I e) can also be replaced by e given the informal semantics of regexes. 
The null predicate is true for all the extensions except ?)ei, where it is 
nuU{ei). 

Figure [8] gives the definitions of fout and for the extensions. Function 
font just applies itself recursively for ?)ei, ?!ei and ?= ei, but it needs to 
rewrite the repetitions using /,„ if their bodies can match e. Function fin 
applies itself recursively to atomic groupings, keeping them atomic, but it 
strips repetitions. A repetition being rewritten by is used directly inside 
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?)/o«t(ei) 
f foutiei) 



[ /m(ei) 
?!/o„t(ei) 



/«n(ei)* 
foutiei) 



*+ 



*+ 



*? 



if -inull{ei) 
if empty (ei) 
otherwise 



if -inw//(ei) 
if empty{ei) 
otherwise 



/o«t(?=ei) = ?=/o«i(ei) 



/»(?)ei) =?)/„(ei) 

. / *+x _ / /m(ei) if null{ei) 

^"^""^ ^ 1 /o«t(ei) otherwise 

f I *T\ ^ \ /m(ei) if null{ei) 

•^'"^''^ ^ 1 /o«t(ei) otherwise 



Figure 8; Definition of Functions font and fin for regex extensions 



another repetition, so it does not matter if it is possessive, lazy, or a regular 
greedy repetition, it is the outer repetition that will govern how much of the 
subject will be matched. 

We do not need to define /j„ for negative and positive lookaheads, as 
pathological uses of these expressions are eliminated by the fout case that 
rewrites repetitions with an empty body and the /j„ cases that rewrite choices 
with empty alternatives. 

The extensions do not impact the optimizations of Section H] if we provide 
a way of computing a FIRST set for them, as the optimizations do not 
depend on the structure of the subexpressions they use. We obviously cannot 
apply the repetition optimization on 6*^^62, efe2, or ?)(e];)e2, but applying 
it on eje2 where extensions appear inside ei or 62 is not a problem. The 
repetition optimizations are turning repetitions into possessive repetitions 
where possible, so not being able to optimize expressions such as the ones 
above is not a loss, as they will already exhibit good backtracking behavior. 

Figure [9] gives an inductive definition for the FIRST sets of extended 
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FIRST{e 
FIRST{a 

FIRST{ei 62 

FIRST{ei I 62 
FIRST{e* 
FIRST{l)e 
FIRST{e*^ 
FIRST{e*- 
FIRST{7\e 
FIRST{l=e 

Figure 9: Definition of FIRST sets for regexes 

regexes. For completeness, we also give cases for the standard regexes. In 
our definition of FIRST sets in terms of relation the FIRST sets cannot 
include e, so expressions that never consume any prefix of the subject have 
empty FIRST sets. The FIRST sets of atomic groupings are conservative, 
as they may be a proper superset of the first characters that the expression 
actually consumes; for example, FIRST(J){e \ a)) is {a} instead of the more 
precise 0. 

7. Conclusion 

We presented a new formalization of regular expressions that uses natural 
semantics and a transformation 11 that converts a given regular expression 
into an equivalent PEG, that is, a PEG that matches the same strings that 
the regular expression matches in a Perl-compatible regex implementation. 
If the regular expression's language has the prefix property, easily guaran- 
teed by using an end-of-input marker, the transformation yields a PEG that 
recognizes the same language as the regular expression. 

We also have shown how our transformation can be easily adapted to 
accommodate several ad-hoc extensions used by regex libraries: indepen- 
dent expressions, possessive and lazy repetition, and lookahead. Our trans- 



= 
= {a} 

f FIRST{ei) if ^null{ei) 

\ FIRSTiei) U FIRST{e2) if null{ei) 

= FIRST{ei)U FIRST (62) 

= FIRST{e) 

= FIRST{e) 

= FIRST{e) 

= FIRST{e) 

= 

= 
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formation gives a precise semantics to what were informal extensions with 
behavior specified in terms of how backtracking-based regex matchers are 
implemented. 

We show that, for some classes of regular expressions, we produce PEGs 
that perform better by reasoning about how the PEG's limited backtracking 
and syntactical predicates work to control the amount of backtracking that 
a PEG will perform. The same reasoning that we apply for large classes of 
expressions can be applied to specific ones to yield bigger performance gains 
where necessary, although our benchmarks show that simple optimizations 
are enough to perform close to optimized regex matchers, while having a 
much simpler implementation: both regex engines we used have over ten 
times the amount of code of the PEG engine. 

Another approach to establish the correspondence between regular ex- 



pressions and PEGs was suggested by lerusalimschy In this approach we 



convert Deterministic Finite Automata (DFA) into right-linear LL(1) gram- 
mars. Medeiros IJ] proves that an LL(1) grammar has the same language 
when interpreted as a CFG and as a PEG. But this approach cannot be used 
with regex extensions, as they cannot be expressed by a DFA. 

The transformation 11 is a formalization of the continuation-based con- 
version presented by Oikawa et al. [19j. That work only presents an informal 
discussion of the correctness of the conversion, while we proved our trans- 
formation correct with regards to the semantics of regular expressions and 
PEGs. 



We can also benefit from the LPEG parsing machine [12|, IJJJ, a virtual 
machine for executing PEGs. We can use the cost model of the parsing 
machine instructions to estimate how efficient a given regular expression or 
regex is. The parsing machine has a simple architecture with just nine basic 
instructions and four registers, and implementations of our transformation 
coupled with implementations of the parsing machine can be the basis for 
simpler implementations of regex libraries. 
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